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Abstract 

Background: Babesia bovis is an apicomplexan parasite tliat causes babesiosis in infected cattle. Genomes of 
patliogens contain promising information tliat can facilitate the development of methods for controlling infections. 
Although the genome of B. bovis is publically available, annotated gene models are not highly reliable prior to 
experimental validation. Therefore, we validated a preproposed gene model of B. bovis and extended the 
associated annotations on the basis of experimentally obtained full-length expressed sequence tags (ESTs). 

Results: From in vitro cultured merozoites, 12,286 clones harboring full-length cDNAs were sequenced from both 
ends using the Sanger method, and 6,787 full-length cDNAs were assembled. These were then clustered, and a 
nonredundant referential data set of 2,1 15 full-length cDNA sequences was constructed. The comparison of the 
preproposed gene model with our data set identified 310 identical genes, 342 almost identical genes, 1,054 genes 
with potential structural inconsistencies, and 409 novel genes. The median length of 5' untranslated regions (UTRs) 
was 152 nt. Subsequently, we identified 4,086 transcription start sites (TSSs) and 2,023 transcriptionally active regions 
(TARs) by examining 5' ESTs. We identified ATGGGG and CCCCAT sites as consensus motifs in TARs that were 
distributed around -50 bp from TSSs. In addition, we found ACACA, TGTGT, and TATAT sites, which were distributed 
periodically around TSSs in cycles of approximately 150 bp. Moreover, related periodical distributions were not 
observed in mammalian promoter regions. 

Conclusions: The observations in this study indicate the utility of integrated bioinformatics and experimental data 
for improving genome annotations. In particular, full-length cDNAs with one-base resolution for TSSs enabled the 
identification of consensus motifs in promoter sequences and demonstrated clear distributions of identified motifs. 
These observations allowed the illustration of a model promoter composition, which supports the differences in 
transcriptional regulation frameworks between apicomplexan parasites and mammals. 
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Background 

Bovine babesiosis is a parasitic infection caused by a proto- 
zoan of the genus Babesia, order Piroplasmida, phylum 
Apicomplexa. Babesia bovis and Babesia bigemina are 
major species that impose a considerable economic burden 
on cattle industries because of their wide geographical dis- 
tribution and pathogenicity [1]. The clinical symptoms of 
B. bovis are more serious than those of B. bigemina, includ- 
ing fever, extensive erythrocyte lysis leading to anemia, 
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icterus, hemoglobinuria, and death. Although antiparasitic 
drugs such as imidocarb successfully control these symp- 
toms [2], they have severe side effects and may promote 
the emergence of resistant strains and residual chemicals. 
Therefore, safer chemical agents and vaccinations are 
required. 

In general, the genome is an excellent tool for under- 
standing all life forms. Unique genes and pathways that 
are elucidated from genomes are often recognized as tar- 
gets for chemical or vaccine development. Because the 
genome sequence of B. bovis is publically available [3], it 
may offer promising information for the development of 
novel approaches for controlling parasitic infections. Ac- 
cording to a previous bioinformatics study, the B, bovis 
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genome encodes 3,671 nuclear protein-coding genes. 
However, estimated gene models based on bioinformat- 
ics lack accuracy in nonmodel organisms. Inconsist- 
encies in gene models have been reported between 
bioinformatics estimates and experimental observations 
of apicomplexan parasites [4,5]. Therefore, to improve 
reliability, gene models require verification with experi- 
mental evidence. 

The acquisition of mRNA sequences is one of the most 
straightforward strategies for verifying gene models. Spe- 
cifically, full-length cDNA libraries facilitate the identifica- 
tion of transcription start sites (TSSs), exon and intron 
structures, 5' and 3' untranslated regions (UTRs), and 
polyadenylation sites. Moreover, massive sets of TSSs can 
be used to identif)^ transcriptionally active regions (TARs), 
which are closely related to promoter regions [6,7]. There- 
fore, the determination of full-length cDNA transcrip- 
tomes is critical for revisions of gene models, and for 
elucidation of transcriptional mechanisms. 

In this study, we collected 5' and 3' expressed se- 
quence tags (ESTs) from full-length cDNAs of B, bovis 
that were synthesized using the oligo-capping method 
[8]. In brief, cap structures at 5' ends of mRNA were re- 
placed with synthetic linker RNA sequences using the 
oligo-capping method. Subsequently, chimeric RNA was 
used to synthesize cDNA with fixed 5' transcript se- 
quences. This cDNA was then sequenced, and the data 
was entered into an updated gene model to identify 
novel genes. In addition, consensus sequences around 
TSSs and putative DNA cis-elements for transcriptional 
control were identified by comparison with promoter re- 
gions identified in genome-wide analyses. 

Results and discussion 

Construction and analysis of full-length cDNA 

A total of 12,286 clones were randomly selected for plas- 
mid extraction (Table 1). Subsequent one-pass sequencing 



from 5' and 3' ends using the Sanger method pro- 
duced 9,573 and 10,956 sequences, respectively (DDBJ: 
HX874250-HX894778). After assembly of paired 5' and 
3' ESTs using Cap3, 7,797 sequences were successfully 
united into one sequence, and one-pass sequences with 
poor quality and genes with long transcripts were ex- 
cluded by miss assembly. Finally, 6,787 sequences passed 
the filter for coding capacity and were selected. These 
were annotated and redundancy was eliminated, resulting 
in 2,115 full-length cDNA sequences (DDBJ: AK440354- 
AK442468), including 1,706 cDNAs that corresponded 
with preproposed gene models in PiroplasmaDB, and 409 
newly annotated genes (Table 1 and Additional file 1: 
Table SI). Among the 409 newly annotated genes, 134 
showed sufficient homology to genes of other apicom- 
plexan parasites (Additional file 1: Table SIB). In addition, 
features of these 134 cDNA sequences were sufficiently 
similar to those of the other gene sets (Additional file 2: 
Table S2), indicating that they may be newly identified 
protein coding transcripts. Among these, numbers of 
the genes with multiple exons and average exon num- 
bers per gene were higher than those in other gene 
models (Additional file 2: Table S2), indicating that 
genes with multiple exons are relatively difficult to predict 
from genome sequences and result in miss annotation. In 
contrast, 273 cDNA sequences with little homology showed 
unique features. Specifically, the median coding sequence 
(CDS) length was shorter, as indicated by the smaller num- 
bers of genes with multiple exons and longer median exon 
lengths than those in other gene sets (Additional file 2: 
Table S2). These observations suggest that certain parts 
of the transcripts identified in this EST analysis are 
noncoding RNA, or were derived from genomic DNA 
as artifacts. Nonetheless, promising protein coding 
cDNA sequences with large CDS lengths and multiple 
exons such as XBBk025260.contig, XBBk029358.contig, 
and XBBkO 14264. contig remained in this gene set. These 



Table 1 Summary of ESTs and contigs 





Number 


Accession number 


Total number of isolated clones 


12286 




5' one-pass sequence 


9573 


DDBJ: HX874250-HX894778 


3' one-pass sequence 


10956 




Contig sequence 


6787 




Non-redundant contig sequence^^ 


2115 


DDBJ: AK440354-AK442468 


IdenticaP^ 


310 




Amino acid variant^^ 


342 




Structural variant^^ 


1054 




Assigned in this study^^ 


409 





1) Nonredundant contig sequences were selected from the contig sequence. Identical, amino acid, structural, and assigned variants were subsets of nonredundant 
contig sequences. 2) Contig sequences with identical coding sequences to the preproposed gene model (ppgm); 3) Contig sequences with almost identical 
coding sequence but amino acid variant(s) derived from single nucleotide variant(s); 4) Contig sequences with structural differences to that of the ppgm assigned 
in this study; 5) Contig sequences not described in the ppgm. 
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B. Z?ov/5-specific novel genes may have B. ^ov/5-specific 
functions in proliferation and host-parasite interac- 
tions. In general, gene finding algorithms such as Glim- 
merHMM [9] require training data sets for better 
prediction. Although training data sets for model organ- 
isms have been constructed using experimental data, 
available Babesia spp. training data sets are limited, po- 
tentially reflecting the observed discrepancies between 
experimentally observed cDNAs and preproposed gene 
models. Because a degree of consistency was observed 
between the 1,706 full-length cDNA sequences and 
preproposed annotations, we performed genome and 
amino acid alignments of these sequences (Table 1). In 
these analyses, 310 sequences were identical to prepro- 
posed genes, whereas 342 were almost identical but with 
amino acid substitutions that probably originated from 
sequencing errors or polymorphisms among strains. 
The remaining 1,054 sequences had partial homology 
to existing annotations, although they had structural 
inconsistencies that may reflect the alternative usage of 
start codons and/or splicing. 

The 5' UTRs that lie between TSSs and first in-frame 
initiation codons are known to play crucial roles in post- 
transcriptional regulation by modulating translational ef- 
ficiency and mRNA stability through the actions of IRES 
and riboswitches [10,11]. This mechanism is observed in 
a wide variety of organisms, including humans, plants, 
and yeast [12-14], suggesting that apicomplexan para- 
sites have similar functions. However, these functions 
have been poorly investigated. Therefore, to elucidate 
the functions of the 5 ' UTRs of B, bovis, we constructed 
a genome-wide 5' UTR sequence data set using full- 
length cDNA sequences and demonstrated that the 
median length of the 5' UTRs of B. bovis is 152 nts. The 
average 5' UTRs are 210.2 nts in humans, 186.3 nts in 
rodents, 221.9 nts in invertebrates, 103.0 nts in viridi- 
plantae, and 134.0 nts in fungi [15] and the mode length 
is approximately 130 nts in Toxoplasma gondii [16]. 
These lengths agree with our observations in B, bovis. 
Similarly, the median length of the 3 ' UTRs of B. bovis 
is 116 nts (Additional file 2: Table S2). 

Gene expression frequencies are also indicated in EST 
data. Therefore, we examined the 9,573 5 ' ESTs data set 
and selected 9,546 sequences following successful map- 
ping onto the B, bovis genome. To estimate expression 
frequencies, these were then mapped onto preproposed 
CDSs with novel sequences identified in this study 
(Additional file 3: Table S3 and Additional file 4: Figure SI). 
The resulting ranking was not identical to that in a pre- 
vious study of ESTs [17], although it showed similar 
tendencies. These discrepancies may reflect differences 
in culture conditions and parasite strains or sampling 
errors associated with small data sets. Logarithmic plots of 
expression levels and ranks of each gene resembled the 



power law (Additional file 4: Figure SI) and indicated 
similar transcriptome distributions to those observed in 
previous studies [18,19]. 

B, bovis promoter components and typical structure 

Transcription is controlled by the coordinated binding 
of promoter sequences by transactivators. In humans 
and model organisms, promoter structures have been 
intensively examined in a genome-wide manner [20-22] 
and have been shown to play pivotal roles in gene and 
phenotype expression. However, the promoter structure 
of Babesia spp. remains unknown. Therefore, we charac- 
terized the promoter structure of B, bovis using high 
resolution TSS information derived from a full-length 
cDNA data set. 

Genome-wide TSS distributions were examined by 
mapping 5 ' ends of 5 ' ESTs. In briefly, 9,412 reliable 5 ' 
end sequences of 36 nts were selected from 9,573 5' 
ESTs. Of these, 7,111 were successfully mapped onto 
the B, bovis genome sequence, 4,086 locations were 
assigned as TSSs after considering redundantly mapped 
sequences, and 2,023 TARs were identified. 

We selected motifs in the -10 to +10 regions of TSSs 
from the TAR data set and examined these using MEME 
[23]. The estimated consensus sequence TYAYWWW 
was found in 801 of the 2,023 TARs, with p values 
of <0.05 (Table 2 and Figure 1). We also examined the 
positional distribution of this motif around TSSs. Exami- 
nations of sequences around TSSs (-100 to +100 region) 
showed that the motif was distributed only on TSSs 
(Figure 2). Moreover, adenine residues at TSSs and cyto- 
sine residues at the -1 position were clearly conserved 
and +3 to +5 positions tended to be thymidine, as shown 
in T, gondii [7]. This CA motif was also conserved in 
initiator consensus sequences from vertebrates [24] and 
dicotyledonous plants [25], despite differences in the 
methods for identifying consensus and diversity of sub- 
ject species. Data sets for B, bovis and T, gondii were 
collected from single organisms, whereas the data sets 
from vertebrates and plants were collected from multiple 
organisms. According to molecular recognition analyses, 
the initiators TAFl and TAF2 play pivotal roles [26-28]. 
In Plasmodium falciparum, PFL1645w and MAL7P1.134 

Table 2 FWM for B, bovis initiator-like motifs from 801 



TSSs 



Position from TSSs 


-2 


-1 


0* 


+1 


+2 


+3 


+4 


A 


142 


0 


801 


109 


270 


308 


234 


T 


411 


245 


0 


326 


480 


255 


325 


G 


104 


0 


0 


133 


0 


127 


137 


C 


144 


556 


0 


233 


51 


111 


105 


consensus 


T 


Y 


A 


Y 


W 


W 


W 



^position of TSS. 
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B. bovis 




T. gondii 



vertebrates 



Figure 1 Consensus sequences on TSSs. Asterisks indicate TSSs 




plants 



are promising functional homologues of TAFl and 
TAF2, respectively, as predicted using bioinformatics 
methods [29]. Moreover, their corresponding genes 
BBOV_IV004260 and BBOV_II003570 were annotated 
in the B. bovis genome, implying that initiator recogni- 
tion and TSSs have evolved with closely related mo- 
lecular mechanisms across taxonomic kingdoms. 

In subsequent analyses, we identified a cis-element 
that is involved in transcriptional control. To generate a 
putative promoter set, -1000 to +1000 regions from typ- 
ical TSSs of the 2,023 TARs were selected and examined 
using CisFinder [30], and -100-0 regions were exam- 
ined using MEME [23]. These analyses showed frequent 
distribution of ATGGGG and AC AC A within promoter 
regions. 

To validate the ATGGGG motif, we examined pos- 
itional distributions of these candidates around TSSs and 
found a clear peak at 50 nts upstream (Figure 3A). Fur- 
ther investigations of the reciprocal sequence CCCCAT 
showed equivalent distribution to that of ATGGGG 
(Figure 3B), implying that the motif may be functional 
regardless of its direction. The CCCCAT motif has been 
identified in Theileria parva and Theileria annulata 
using encyclopedic promoter analyses. Although the re- 
ciprocal motif ATGGGG was not examined in these 
species, its peak was found at -20 nts from TSSs, differ- 
ing slightly from our observations [31]. In further inves- 
tigations, we examined functional enrichments of genes 
carrying these promoter motifs, and identified genes 
corresponding to the 2,023 TARs by calculating relative 



distances. Subsequently, 1,315 TARs were found with 
candidate initiation codons. Among these, 222 TARs 
had the ATGGGG or CCCCAT motifs in the -80 to -20 
region from TSSs. Subsequent enrichment analyses using 
gene ontology terms from GOstat [32] indicated signifi- 
cant enrichment in "structural constituent of ribosome" 
(GO:0003735) and "translation" (GO:0009058) categories, 
with E-values of 3.43e"^^ and 2.06e"^^, respectively. En- 
richments of protein synthesis have also been reported for 
the CCCCAT motif in T, parva and T, annulata [31], sug- 
gesting that the motif may be conserved in piroplasms as 
a transcriptional regulator of genes involved in protein 
synthesis. 

To validate the ACACA motif, we examined the pos- 
itional distribution of these candidates and found period- 
ical distribution around TSSs (Figure 3C). The reciprocal 
sequence TGTGT was also periodically distributed, but its 
phase was shifted (Figure 3C). Based on these observa- 
tions, we examined all 5-mer repeat motifs comprising 
two nucleotides and found periodical distribution of 
TATAT as an additional motif (Figure 3C), with similar cy- 
cles but differing phases (Figure 3C). The related motifs 
CACAC, GTGTG, and ATATA also showed similar distri- 
butions (data not shown). The only other combinations 
that showed distinguishing distributions were GGGGG, 
CCCCC, GCGCG, and CGCGC, with peaks around -50 
nts (Figure 3D). GGGGG and CCCCC motifs are closely 
related to ATGGGG and CCCCAT motifs, respectively. 
However, GCGCG and CGCGC motifs may be functional 
and gene ontology enrichment analyses showed frequent 




+100 



Figure 2 Distribution of the TYAYWWW motif around TSSs. Tine liorizontal axis represents sequence areas from -100 to +100 around 
TSSs witli 1-nt resolution. Position 0 represents TSS. A peak was observed at tine -2 position. Tine vertical axis represents the ratio of the motif 
frequency and the theoretical frequency (see Methods). 
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Figure 3 Distributions of motifs around TSSs. Horizontal axes 
represent sequence areas from -1000 to +1000 from TSSs. Vertical 
axes represent the ratio of motif frequencies and theoretical 
frequencies (see Methods). (A) Distributions of the ATGGGG motif; 
(B) Distribution of the CCCCAT motif; (C) Distributions of the ACACA 
(blue), TGTGT (red), and TATAT (green) motifs; (D) Distributions of 
the other 5-mer motifs; (E) Distributions of the ACACA (blue), TGTGT 
(red), and TATAT (green) motifs in humans; (F) Distributions of all 
5-mer repeat combinations in humans; (G) Distributions of the known 
apicomplexan motifs GATTCC, GCATGC, GTGCAC, and TAGCTA and the 
core mammalian promoter motifs TATA, TATAWAAR; BRE'', SSRCGCC; 
BRE'^, RTDKKKK; and the DPE, RGWm. In A, B, and G, the scanning 
window was 10 bp. In C-F the scanning window was 30 bp. 



but insignificant presence of these in upstream pro- 
moter regions of ribonucleoprotein complex biogenesis 
(GO:0022613) genes (p = 0.078). To confirm the speci- 
ficity of these motifs for apicomplexan parasites, we 
examined periodical distributions in the TSS database 
DBTSS, which contains precise positions of TSSs in the 
genomes of various organisms [33]. Promoter regions 
from -1000 to +1000 of human and mouse TSSs were 
obtained and the distribution of ACACA, TGTGT, and 
TATAT motifs were examined as in B. bovis. However, 
no periodical distributions were found in human (Figure 3E) 
or mouse (data not shown) databases, and no related 
periodical distributions of other combinations were 
observed as in B. bovis (Figure 3F). In contrast, the 
ACACA motif was reportedly observed in T. parva and 
P, falciparum [31,34], although periodical distributions 
have not been reported. Rather than reflecting the dif- 
ferences in species, these discrepancies may have been 
caused by differences in the precision of TSS identifica- 
tion. Nonetheless, these observations imply that the 
motif is common among some apicomplexan parasites, 
and the present periodical patterns had interval lengths 
of 140-150 nts. Minimum units of nucleosome repeat 
lengths comprise 147-bp DNA sequences around core 
histone octamers and 20-bp DNA linkers and are much 
longer than our observations. However, previous studies 
demonstrate that the minimum observed nucleosome 
repeat length is much closer to our observation of ap- 
proximately 155 bp in Schizosaccharomyces pombe and 
Aspergillus nidulans [35-37], zxvd P, falciparum [38,39]. 
On the other hand, these discrepancies may reflect the 
involvement of unconventional nucleosome structures. 
The conventional histone octomer comprises two H2A, 
H2B dimers and H3, H4 tetramers. In contrast, uncon- 
ventional histones comprising variants such as H2B.Z, 
H2A.Z, and CENP-A have specific functions that are 
distinguishable from the conventional one. Crystal struc- 
ture analysis of human centromeric nucleosomes contain- 
ing CENP-A suggests that only 121 -bp DNA fragments 
tightly bind to nucleosomes, unlike conventional H3 
nucleosomes [40]. In P. falciparum, it was demonstrated 
that the nucleosome with H2A.Z specifically localizes to 
intergenic regions [41,42]. Moreover, no homologue to 
the linker histone HI has been identified in apicom- 
plexan parasites [43]. FAIRE-seq and MAINE-seq ana- 
lyses in P, falciparum demonstrated that nucleosome 
binding to TSSs is associated with gene expression [44] 
and there are preferred DNA motifs for nucleosome 
assembly [45,46]. These collateral data warrant the 
assumption that the observed periodical patterns in this 
study are involved in chromatin structure and regulate 
gene expression via chromatin remodeling processes. 

In further analyses, we applied this scanning method 
to known apicomplexan and mammalian core promoter 
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motifs. A previous study showed the distribution of 
GATTCC in T, parva and T, annulata at regions that 
are -20 nts from TSS [31]. Moreover, GCATGC was 
identified as a PF14_0633-binding target in P, falcip- 
arum [47], an AP2-Sp-binding target in Plasmodium 
berghei [48], and a Toxoplasma Ribosomal Protein 
(TRP)-2-binding target in T. gondii [49]. GTGCAC is 
known as a subtelomeric variant gene promoter elem- 
ent (SPE)-2 [49] and a binding target of PFF0200c_DLD 
and PfSIP2 [47,50,51]. TAGCTA is also reportedly a bind- 
ing target of Pb.AP2-0 [52]. Therefore, although the 
GTGCAC motif was moderately concentrated in the -50- 
nt area, the other motifs did not show distinguishing 
distributions in comparison with the ATGGGG/CCCCAT 
motif (Figure 3G). In particular, GATTCC was not specif- 
ically distributed around TSSs, as observed in T, parva 
and T, annulata, indicating that the motif is specific to 
Theileria spp. and may be involved in specific biological 
phenotypes, such as infectivity in lymphocytes. Accord- 
ing to mammalian motifs, we examined TATA boxes, 
upstream TFIIB-recognition elements (BRE^), down- 
stream ERE (BRE^), and downstream promoter elements 
(DPE), containing the consensus sequences TATAWAAR, 
SSRCGCC, RTDKKKK, and RGWYVT, respectively [20]. 
In these analyses, TATA boxes showed periodical-like 
patterns (Figure 3G). In contrast, TATA boxes are known 
to be distributed around -30 nts from TSSs (Figure 3E), 
and TATA box consensus sequences are closely related to 
TATAT and ATATA motifs. These are also periodically 
distributed, suggesting that the observed pattern for TATA 
boxes was residual and no functional motifs correspond 
with the TATA box in B. bovis. This observation also indi- 
cates that other mammalian motifs are nonfunctional 
(Figure 3G). 

Collectively, we speculate model promoter structures 
and transcriptional mechanisms in B. bovis that explain 
our observations (Figure 4). Primarily, we identified 
the TSS initiator-like motif TYAYWWW. In other taxo- 
nomic kingdoms, this initiator works as a binding site 
for the general transcription factors TAFl and TAF2 
[20], and previous in silico analyses demonstrate that 
apicomplexan parasites express homologs of TAFl and 
TAF2 [29]. Therefore, B. bovis may also use this molecu- 
lar mechanism at the final step of transcriptional initi- 
ation, as described previously in P. falciparum [53]. 
Similar to T. gondii and majority of other organisms, the 
average length of 5 ' UTRs was 150 nts, suggesting simi- 
lar involvement in the regulation of gene expression, 
similar to that in other organisms. Periodical distribu- 
tions of ACACA, TGTGT, and TATAT were observed 
around TSSs. However, this profile was not observed in 
human and mouse (Figure 3), and previous studies indi- 
cate that transcriptional mechanisms differ between api- 
complexan parasites and other eukaryotes to a certain 



ACACA 



150 bp 


150 bp ^1 


TATAT 


TGTGT i 




50 bp \I52 bp (5'UTR) 

/ \ 

ATGGGG/CCCCAT TYAYWWW 

Figure 4 Schematic representation of the composition and 
speculative nucleosome structure of a model B. bovis 
promoter. Angled arrows, thick lines, big circles, and boxes 
represent TSSs, DNA, histones, and coding sequences, respectively. 
Motifs in promoter regions are represented by small circles and the 
corresponding sequences are indicated by dashed arrows. The 
mYWWW motif over TSS is shown in Figures 1 and 2, and Table 2. 
The median length from TSS to the 5' end of CDS (5' UTR) was 
152 bp (Additional file 2: Table S2). The ATGGGG motif and its 
reciprocal CCCCAT are distributed around -50 nt from TSS 
(Figure 3A and B). The ACACA, TGTGT, and TATAT motifs appear 
every 150 bp (Figure 3C). Positional relationships among the motifs 
and histones are arbitrarily described in this illustration. 



degree [43,53,54]. In particular, we assumed that the 
periodical distributions are involved in tight assembly 
of nucleosome structures and control transcription, 
although discrepancies of nucleosome repeat lengths 
remains to be clarified by additional experimental evi- 
dences. On the other hand, we observed clear peak 
distributions of ATGGGG and CCCCAT at -50-bp 
regions from TSSs. Although it remains unclear how 
this motif functions regardless of orientation, chroma- 
tin remodeling factors may be recruited to loosen 
nucleosome structures. Therefore, the scheme shown 
in Figure 4 proposes transcriptional arrest by histones 
and subsequent activation by putative chromatin remodel- 
ing factors that interact with ATGGGG or CCCCAT 
elements. 

Previous investigations of Plasmodium and Toxo- 
plasma demonstrate promoter structures [43,53-55], 
putative DNA cis-elements [34,44,50,54,56-58], and the 
involvement of chromatin structures in transcription 
[44,53,55,59]. The present analyses of Babesia parasites 
were almost consistent with these studies and warrant 
the expansion of the concepts related to Babesia spe- 
cies. Nonetheless, the use of fine TSS mapping is a 
critical distinction between the present and previous 
studies and allowed more specific and sensitive assess- 
ment of the distribution of examined motifs, particu- 
larly for ACACA, TGTGT, and TATAT motifs that lack 
definition in previous studies [34,50]. Therefore, the 
present analyses indicate that the distance from TSSs 
may be a critical factor for functionality of DNA cis- 
elements in apicomplexsan parasites. 
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Conclusions 

The full-length cDNAs dataset enable us to revise 
previous gene model derived from the genome. In 
parallel, location-specific consensus motifs in pro- 
moter sequences were discovered by virtue of TSSs 
identification with one-base resolution of the method. 
These observations 1) indicate the utility of integrated 
bioinformatics and experimental data for improving 
genome annotations and 2) allowed the illustration of 
a model promoter composition, which supports the 
differences in transcriptional regulation frameworks 
between apicomplexan parasites and mammals. 

Methods 

Preparation of parasite RNA, and synthesis and 
sequencing of cDNA 

The Texas strain of B, bovis was maintained in bovine 
erythrocytes cultured in GIT medium (WAKO, Osaka, 
Japan) using a microaerophilic stationary-phase culture 
system [60]. Total RNA was extracted from B, bovis- 
infected erythrocytes using TRIzol (Invitrogen), and 
cDNA was synthesized using a previously described oligo- 
capping method [61]. In briefly, 200 \i% of purified total 
RNA was dephosphorylated using bacterial alkaline phos- 
phatase and was ligated using the oligo-RNA 5'-AGCAU 
CGAGUCGGCCUUGUUGGCCUACUGG-3' and T4 
RNA ligase. Subsequently, cDNA was synthesized using 
the oligo-dT fusion primer 5'-GCGGCTGAAGACGGC 
CTATGTGGCCTTTTTTTTTTTTTTTTT-3' with Sup- 
erscript® II (Invitrogen). The cDNA library was amplified 
using PGR with the primers 5'-AGCATCGAGTCGGC 
CTTGTTG-3' and 5'-GCGGCTGAAGACGGCCTATG 
T-3'. Amplified fragments were then digested using Sfil 
and were ligated into a Dralll-digested pME18SFL3 plas- 
mid vector in an orientation-defined manner. ESTs of 5' 
and 3 ' ends were obtained using the Sanger method with 
ABI 3730 sequencers following standard protocols for 
sequencing analysis. 

Assembly, clustering, and annotation of ESTs 

To obtain full-length cDNA sequences, 5' and 3' ESTs 
were assembled using a CAP3 [62] with default parame- 
ters. The overlapping nucleic acid length cutoff was 40 
and the overlapping identity cutoff was 90% for 5' and 
3 ' ESTs, respectively. Putative CDSs of full-length cDNA 
were examined and intact CDSs of >50 amino acids were 
selected. Amino acid homology with preproposed genes 
(BbovisT2Bo AnnotatedProteins_PiroplasmaDB- 1 . 1 .fasta) 
was examined using BLAST and homology was consid- 
ered significant when E-values were 10"^^. Full-length 
cDNA sequences were also mapped onto the genome of 
B, bovis (BbovisT2BoGenomic_PiroplasmaDB- 1.1. fasta) 
using BEAT [63] with default parameters. The categories 
of full-length cDNA sequences (tier 1; identical, tier 2; 



amino acid variant, tier 3; structural variant, and tier 4; 
novel) were assigned according to BLAST and BEAT re- 
sults using an in-house script. Copy DNAs that were 
mapped to positions of identical nucleic acid sequences 
of preproposed genes were assigned as "identical", and 
those mapped to identical positions but with amino acid 
substitutions were assigned as "amino acid variants". 
Copy DNAs that were mapped to similar but nonidentical 
positions to homologous preproposed genes were assigned 
as "structural variants", and those with poor homology 
were assigned as "novel". Full-length cDNAs were clus- 
tered using CAP3 with default parameters, and the subset 
of cDNAs in the same cluster were integrated into the 
highest tier. Among full-length cDNAs with differing 
sequences and the same tier status, cDNAs with fewer 
mutations and longer amino acids or nucleic acids were 
selected, and novel cDNAs were annotated using Blas- 
t2Go [64]. To estimate gene expression, 5' ESTs were 
examined. Initially, these were filtered by BLAST using 
the B, bovis genome database (BbovisT2BoGenomic). 
Subsequently, the filtered sequences were mapped onto 
both CDS (BbovisT2BoAnnotatedCDS) and novel se- 
quences using BLAST, and the frequencies of mapped 
ESTs were determined. 

Identification of promoter regions and prevalent motifs 

To identify TSSs for each 5' EST, 36 nts were clipped 
from 5 ' ends and removed if they contained ambiguous 
nucleotides such as N. Selected sequence sets were then 
mapped onto the genome sequence of the B, bovis T2Bo 
strain [3] using Bowtie [65] with the acceptance of two 
mismatches. To assign TARs, mapped positions corre- 
sponding with TSSs were clustered if two TSSs were po- 
sitioned within 20 nts. Gross mapped counts for each 
position and TAR were tallied, and the most frequently 
mapped TSS in each TAR was assigned as a representative 
TSS, as defined in previous reports [33,66]. To identify 
motifs on TSSs, -10 to +10 regions from representative 
TSSs were selected and examined using MEME [23]. A 
frequency weight matrix (FWM) was calculated on the 
basis of sequence motifs with p values of <0.05, and a 
sequence logo was generated on the basis of FWM using 
WebLogo [67]. As a putative promoter region, sequences 
comprising -1000 to +1000 regions from representative 
TSSs of each TAR were selected. Human and mouse pro- 
moter regions were obtained from DBTSS [33]. Candidate 
DNA motifs were then estimated using CisFinder [30] 
with default parameters and -1000 to +1000 regions from 
representative TSSs. These were also estimated using 
MEME with n sites of 2023, maximum DNA sizes of 
250000, and maxw of 8 as parameters in -100-0 regions 
of representative TSSs. The distributions of identified 
motifs around peak TSSs were examined by scanning the 
motif over the promoter using an in-house script. The 
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positions of each motif were scanned in exact match 
condition and summed for every 10 bases. Observed 
frequencies were divided by the theoretical frequency 
based on nucleotide biases that were estimated from the 
nucleotide composition of the genome. The distributions 
of 5-mer motif candidates were smoothed by averaging 
the surrounding 30 bases. For functional enrichment ana- 
lyses, we selected TARs with motifs from the initial TAR 
set of 2,023-nt sequences. Genes and TARs were linked if 
TARs were present in the -500 to +200 region from the 
5 ' end of the CDS, as defined in the BbovisT2BoAnnota- 
tedCDS. Finally, gene ontology terms for B, bovis genes 
were annotated using Blast2Go [64] . 

Availability of supporting data 

Supporting sequence data are available in the DDBJ 
(http://www.ddbj.nig.ac.jp/index-e.html) under acces- 
sion numbers HX874250-HX894778 and AK440354- 
AK442468. 
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